CS236781: Deep Learning on Computational Accelerators¶

Homework Assignment 1¶

Faculty of Computer Science, Technion.

Submitted by:

# Name Id email
Student 1 Ofer Nissim 312367576 ofer.nissim@campus.technion.ac.il
Student 2 Lotan Amit 208800342 lotan.amit@campus.technion.ac.il

Introduction¶

In this first homework assignment we'll familiarize ourselves with PyTorch as a general-purpose tensor library with automatic gradient calculation capabilities. We'll use it to implement some traditional machine-learning algorithms and remind ourselves about basic machine-learning concepts such as different data sets and their uses, model hyperparameters, cross-validation, loss functions and gradient derivation. We'll also familiarize ourselves with other highly useful python machine learning packages such as numpy, sklearn and pandas.

General Guidelines¶

  • Please read the getting started page on the course website. It explains how to setup, run and submit the assignment.
  • The text and code cells in these notebooks are intended to guide you through the assignment and help you verify your solutions. The notebooks do not need to be edited at all (unless you wish to play around). The only exception is to fill your name(s) in the above cell before submission. Please do not remove sections or change the order of any cells.
  • All your code (and even answers to questions) should be written in the files within the python package corresponding the assignment number (hw1, hw2, etc). You can of course use any editor or IDE to work on these files.

Contents¶

  • Part 1: Working with data in PyTorch
    • Datasets
    • Built-in Datasets and Transforms
    • DataLoaders and Samplers
    • Training, Validation and Test Sets
  • Part 2: Nearest-neighbor classification
    • kNN Classification
    • Cross-validation
  • Part 3: Multiclass linear classification
    • Linear Classification
    • Loss Functions
    • Optimizing a Loss Function with Gradient Descent
    • Training the model with SGD
  • Part 4: Linear Regression
    • Dataset exploration
    • Linear Regression Model
    • Adding nonlinear features
    • Generalization
In [1]:
print('hi')
hi
$$ \newcommand{\mat}[1]{\boldsymbol {#1}} \newcommand{\mattr}[1]{\boldsymbol {#1}^\top} \newcommand{\matinv}[1]{\boldsymbol {#1}^{-1}} \renewcommand{\vec}[1]{\boldsymbol {#1}} \newcommand{\vectr}[1]{\boldsymbol {#1}^\top} \newcommand{\rvar}[1]{\mathrm {#1}} \newcommand{\rvec}[1]{\boldsymbol{\mathrm{#1}}} \newcommand{\diag}{\mathop{\mathrm {diag}}} \renewcommand{\set}[1]{\mathbb {#1}} \newcommand{\norm}[1]{\left\lVert#1\right\rVert} \newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\bb}[1]{\boldsymbol{#1}} $$

Part 1: Working with data in PyTorch¶

In this part, we'll learn about the Dataset and DataLoader classes which are part of PyTorch's torch.util.data package. These are highly useful abstractions that can greatly reduce the amount of boilerplate code you need to write in order to work with data. Knowing how to use these classes properly will prove useful in the coming assignments and course project.

In [1]:
import torch
import torchvision
import numpy as np
import matplotlib.pyplot as plt
import unittest

%matplotlib inline
%load_ext autoreload
%autoreload 2

plt.rcParams.update({'font.size': 12})
torch.random.manual_seed(42)
test = unittest.TestCase()

Datasets¶

The Dataset class is an abstraction over a sequence of python objects, each representing a sample (with or without a label). it's main purpose is to load a single (possibly labeled) sample from some soure (disk, web, etc) into memory, and transform it into a usuable representation (e.g. image to tensor).

The Dataset abstracts away exactly when the data is loaded into memory: It can be on demand when each sample is accessed, all in advance or some combination using e.g. caching. This is implementation-specific.

As a warm up, lets create a demonstration Dataset that returns noise images. It should:

  • Return tensors of shape (C, W, H) containing random contents.
  • Label each returned tensor with a class label, an integer between 0 and num_classes-1.
  • Initialize each returned tensor with a uniform distribution on [0, 255].
  • Return a total of num_samples labeled images.
  • The same image should be returned every time the dataset is accessed as the same index.

First, let's implement a simple function to generate a labelled random image.

TODO Implement the random_labelled_image function in the hw1/datasets.py module. Use the code below to test your implementation.

In [2]:
import hw1.datasets as hw1datasets
import cs236781.plot as plot

image_shape = (3, 32, 64)
num_classes = 3
low, high = 0, 10

# Generate some random images and check values
X_ = None
for i in range(100):
    X, y = hw1datasets.random_labelled_image(image_shape, num_classes, low, high)
    test.assertEqual(X.shape, image_shape)
    test.assertIsInstance(y, int)
    test.assertTrue(0<= y < num_classes)
    test.assertTrue(torch.all((X >= low) & (X < high)))
    if X_ is not None:
        test.assertFalse(torch.all(X == X_))
    X_ = X
    
plot.tensors_as_images([X, X_]);

In many cases we'll need to consistently get repeatable results even though we're using pseudo-random number generators (PRNGs). The way to do this is to provide a seed to the generator. Given the same seed, a PRNG will always generate the same sequence of numbers.

Here, we need a way to generate the same random image when accessing our dataset at the same index (e.g. to simulate a real set of images).

TODO Implement the torch_temporary_seed function in the hw1/datasets.py module. Use the code below to test your implementation.

In [3]:
seeds = [42, 24]
torch.random.manual_seed(seeds[0])

# Before the context, the first seed affects the output
data_pre_context = torch.randn(100,)

with hw1datasets.torch_temporary_seed(seeds[1]):
    # Within this context, the second seed is in effect
    data_in_context = torch.randn(100,)
    
# After the context, the random state should be restored
data_post_context = torch.randn(100,)
data_around_context = torch.cat([data_pre_context, data_post_context])

# Use first seed, generate data in the same way but without changing context in the middle
torch.random.manual_seed(seeds[0])
data_no_context = torch.cat([torch.randn(100,), torch.randn(100,)])

# Identical results show that the context didn't affect external random state
test.assertTrue(torch.allclose(data_no_context, data_around_context))

# The data generated in the context should match what we would generate with the second seed
torch.random.manual_seed(seeds[1])
test.assertTrue(torch.allclose(data_in_context, torch.randn(100,)))

Now we can implement the dataset as required.

TODO Implement the RandomImageDataset class in the hw1/datasets.py module. Use the code below to test your implementation.

In [4]:
# Test RandomImageDataset

# Create the dataset
num_samples = 500
num_classes = 10
image_size = (3, 32, 32)
ds = hw1datasets.RandomImageDataset(num_samples, num_classes, *image_size)

# You can load individual items from the dataset by indexing
img0, cls0 = ds[139]

# Plot first N images from the dataset with a helper function
fig, axes = plot.dataset_first_n(ds, 9, show_classes=True, nrows=3)

# The same image should be returned every time the same index is accessed
for i in range(num_samples):
    X, y = ds[i]
    X_, y_ = ds[i]
    test.assertEqual(X.shape, image_size)
    test.assertIsInstance(y, int)
    test.assertEqual(y, y_)
    test.assertTrue(torch.all(X==X_))
    
# Should raise if out of range
for i in range(num_samples, num_samples+10):
    with test.assertRaises(ValueError):
        ds[i]

This simple dataset is a useful abstraction when we know in advance the number of samples in our dataset and can access them by indexing. However, in many cases we simply cannot know about all data in advance. For example, perhaps new data is generated in real time.

To deal with these cases, we can use a different type of abstraction: an IterableDataset which provides an interface only to iterate over samples, but not to index them directly. Let's implement such a dataset which will allow us to iterate over an infinite stream of randomly-generated images.

In [5]:
ds = hw1datasets.ImageStreamDataset(num_classes, *image_size)

# This dataset can't be indexed
with test.assertRaises(NotImplementedError):
    ds[0]
    
# There is no length
with test.assertRaises(TypeError):
    len(ds)
    
# Arbitrarily stop somewhere
stop = torch.randint(2**11, 2**16, (1,)).item()
    
# We can iterate over it, indefinitely
for i, (X, y) in enumerate(ds):
    test.assertEqual(X.shape, image_size)
    test.assertIsInstance(y, int)

    if i > stop:
        break

print(f'Generated {i} images')
test.assertGreater(i, stop)
Generated 22112 images

Built-in Datasets and Transforms¶

Now that we've created a simple Dataset to see how they work, we'll load one of pytorch's built-in datasets: CIFAR-10. This is a famous dataset consisting of 60,000 small 32x32 color images classified into 10 classes. You can read more about it here.

The torchvision package has built-in Dataset classes that can download the data to a local folder, load it, transform it using arbitrary transform functions and iterate over the resulting samples.

Run the following code block to download and create a CIFAR-10 Dataset. It won't be downloaded again if already present.

Run the following block to download CIFAR-10 and plot some random images from it.

In [6]:
import os
import torchvision
import torchvision.transforms as tvtf

cfar10_labels = ('plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck')
data_root = os.path.expanduser('~/.pytorch-datasets')

cifar10_train_ds = torchvision.datasets.CIFAR10(
    root=data_root, download=True, train=True,
    transform=tvtf.ToTensor()
)

print('Number of samples:', len(cifar10_train_ds))

# Plot them with a helper function
fig, axes = plot.dataset_first_n(cifar10_train_ds, 64,
                                 show_classes=True, class_labels=cfar10_labels,
                                 nrows=8, hspace=0.5)
Files already downloaded and verified
Number of samples: 50000

Now that we've loaded the entire CIFAR-10 dataset, we would like to work with a smaller subset from it to reduce runtime of the code in this notebook. A simple way to achieve this with Datasets is to wrap a Dataset in another Dataset that does this for us. This will make it easy to use our subset with DataLoaders as you will see later.

TODO Complete the implementation of SubsetDataset in hw1/datasets.py and use the following code block to test.

In [7]:
subset_len = 5000
subset_offset = 1234
cifar10_train_subset_ds = hw1datasets.SubsetDataset(cifar10_train_ds, subset_len, subset_offset)

dataset_x, dataset_y  = cifar10_train_ds[subset_offset + 10]
subset_x, subset_y  = cifar10_train_subset_ds[10]

# Tests
test.assertEqual(len(cifar10_train_subset_ds), subset_len)
test.assertTrue(torch.all(dataset_x == subset_x))
test.assertEqual(dataset_y, subset_y)
with test.assertRaises(IndexError, msg="Out of bounds index should raise IndexError"):
    tmp = cifar10_train_subset_ds[subset_len]

Notice that when we initialized the Dataset instance for CIFAR-10, we provided a transform parameter. This is a way to specify an arbitrary transformation that should be run on each sample prior to returning it from the dataset.

In the above, we used the ToTensor() transformation from torchvision.transforms to convert the images from a PIL (Python Imaging Library) image object which has a shape of 32x32x3 and values in range [0, 255] into a pytorch Tensor of shape 3x32x32 and values in range [0, 1].

To demonstrate the use of transforms, we'll implement two custom transforms which invert the colors and flip the images around the horizontal axis.

TODO Complete the InvertColors and FlipUpDown classes in the hw1/transforms.py module.

In [8]:
import hw1.transforms as hw1transforms

cifar10_inverted_ds = torchvision.datasets.CIFAR10(
    root=data_root, download=True, train=True,
    transform=tvtf.Compose([ # Compose allows us to chain multiple transforms in a sequence
        tvtf.ToTensor(), # Convert PIL image to pytorch Tensor (C,H,W) in range [0,1]
        hw1transforms.InvertColors(),
        hw1transforms.FlipUpDown(),
    ])
)

fig, axes = plot.dataset_first_n(cifar10_inverted_ds, 64,
                                 show_classes=True, class_labels=cfar10_labels,
                                 nrows=8, hspace=0.5)

test.assertTrue(torch.allclose(cifar10_train_ds[0][0], torch.flip(1.-cifar10_inverted_ds[0][0], [1])),
               "Wrong custom transform")
Files already downloaded and verified

DataLoaders and Samplers¶

We have seen that a Dataset is simply an iterable allowing us to iterate over samples and posssible to also access them by index. Simple to implement, but not very powerful. The real benefit is when combining them with DataLoader. A DataLoader samples a batch of samples from the dataset according to logic defined by a Sampler object. The sampler decides how to partition the dataset into batches of N samples. The DataLoader additionally handles loading samples in parallel to speed up creation of a batch.

A major motivation here is memory usage. When combining a DataLoader with a Dataset we can easily control memory constraints by simply setting the batch size. This is important since large datasets (e.g. ImageNet) do not fit in memory of most machines. Since a Dataset can lazily load samples from disk on access, and the DataLoader can sample random samples from it in parallel, we are provided with a simple yet high-performance mechanism to iterate over random batches from our dataset without needing to hold all of it in memory.

Let's create a basic DataLoader for our CIFAR-10 dataset. Run the follwing code block multiple times and observe that different samples are shown each time in the first few batches.

In [9]:
# Create a simple DataLoader that partitions the data into batches
# of size N=8 in random order, using two background proceses
cifar10_train_dl = torch.utils.data.DataLoader(
    cifar10_train_ds, batch_size=8, shuffle=True, num_workers=2
)

# Iterate over batches sampled with our DataLoader
num_batches_to_show = 5
for idx, (images, classes) in enumerate(cifar10_train_dl):
    # The DataLoader returns a tuple of:
    # images: Tensor of size NxCxWxH
    # classes: Tensor of size N
    fig, axes = plot.tensors_as_images(images, figsize=(8, 1))
    fig.suptitle(f'Batch #{idx+1}:', x=0, y=0.6)
    if idx >= num_batches_to_show - 1:
        break

Here, we specified shuffle=True to the DataLoader. This automatically created a Sampler which just returns indices from the DataSet in a random order.

To better control the content of the batches, we can create our own custom sampler. Imagine we want each batch to contain one sample from the beginning of the dataset and another from the end. If we have N samples, we would like to get the following sequence of indices: [0, N-1, 1, N-2, 2, N-3, ...] and then use abatch_size of 2.

TODO Implement the FirstLastSampler class in the hw1/dataloaders.py module.

In [10]:
import hw1.dataloaders as hw1dataloaders

# Test sampler with odd number of elements
sampler = hw1dataloaders.FirstLastSampler(list(range(5)))
test.assertEqual(list(sampler), [0,4, 1,3, 2,])

# Test sampler with evennumber of elements
sampler = hw1dataloaders.FirstLastSampler(list(range(6)))
test.assertEqual(list(sampler), [0,5, 1,4, 2,3])


# Create a DataLoader that partitions the data into batches
# of size N=2 in an order determined by our custom sampler
cifar10_train_dl = torch.utils.data.DataLoader(
    cifar10_train_ds, batch_size=2, num_workers=0,
    sampler=hw1dataloaders.FirstLastSampler(cifar10_train_ds),
)

# Iterate over batches sampled with our DataLoader
num_batches_to_show = 3
for idx, (images, classes) in enumerate(cifar10_train_dl):
    fig, axes = plot.tensors_as_images(images, figsize=(8, 1))
    fig.suptitle(f'Batch #{idx+1}:', x=0, y=0.6)
    if idx >= num_batches_to_show - 1:
        break

Training, Validation and Test Sets¶

Now that we know about DataLoaders we can use them to do something useful: split a training dataset into Training and Validation sets.

A common issue in machine learning models is abundance of hyperparameters that must be selected prior to training the model on data. These hyperparameters may be part of the model itself or part of the training process. We would like to determine which hyperparameter selection can best fit the training data, and, more importantly, can be able to generalize to unseen data.

A prevalent approach is therefore to split the training dataset into two parts: One for actual training, i.e. tuning model parameters e.g. weights in the case of neural nets, and another for validation, i.e. comparing one model or set of hyperparameters to another. After the best model is selected (by seeking the minimal validation error), it can be retrained with the entire training set.

img

TODO Implement the function create_train_validation_loaders in the hw1/dataloaders.py module. Use the following code block to check your implementation.

In [11]:
# Testing the train/validation split dataloaders
import hw1.dataloaders as hw1dataloaders

validation_ratio = 0.2
dl_train, dl_valid = hw1dataloaders.create_train_validation_loaders(cifar10_train_ds, validation_ratio)

train_idx = set(dl_train.sampler.indices)
valid_idx = set(dl_valid.sampler.indices)
train_size = len(train_idx)
valid_size = len(valid_idx)
print('Training set size: ', train_size)
print('Validation set size: ', valid_size)

# Tests
test.assertEqual(train_size+valid_size, len(cifar10_train_ds), "Incorrect total number of samples")
test.assertEqual(valid_size, validation_ratio * (train_size + valid_size), "Incorrect ratio")
test.assertTrue(train_idx.isdisjoint(valid_idx), "Train and validation sets are not disjoint")
Training set size:  40000
Validation set size:  10000

Questions¶

TODO Answer the following questions. Write your answers in the appropriate variables in the module hw1/answers.py.

In [12]:
from cs236781.answers import display_answer
import hw1.answers

Question 1¶

Determine whether each of the following statements is true or false, and explain why in detail:

  1. The test set allows us to estimate our in-sample error.
  2. Any split of the data into two disjoint subsets would constitute an equally useful train-test split.
  3. The test-set should not be used during cross-validation.
  4. After performing cross-validation, we use the validation-set performance of each fold as a proxy for the model's generalization error.
In [13]:
display_answer(hw1.answers.part1_q1)

Your answer:

  1. *False*

The test set allows us to estimate the out-of-sample loss (the theoretical loss). The train set allows us to calculate the in-sample loss.

  1. *False*

First of all, for technical reasons: if we make the train set too small in the split, it won't be useful. Also, if one of the classes has a very low frequency (for example - a rare disease we are trying to detect), then some of the splits won't contain samples with the class's label. thus, won't be useful.

  1. True

Using the test set is forbidden! this can cause a data-leakage - where the model is learning from the test set, making the test performance invalid. The test set should only be used after the model is fully trained.

  1. False

We use the test set performance as the proxy, not the validation set performance.

Question 2¶

Your friend has trained a simple linear regression model, e.g. $\hat{y}=\vectr{w}\vec{x}+b$, with some training data. He then evaluated it on a disjoint test-set and concluded that the model has over-fit the training set and therefore decided to add a regularization term $\lambda \norm{\vec{w}}^w$ to the loss, where $\lambda$ is a hyper parameter. In order to select the value of $\lambda$, your friend re-trained the model on his training set with different values of $\lambda$ and then chose the value which produced the best results on the test set.

Is your friend's approach justified? Explain why or why not.

In [14]:
display_answer(hw1.answers.part1_q2)

Your answer:

Our friend's approach is not justified. This approach is manually creating a data-leakage. Hyper-parameter tuning should be performed with a validation set, and not with the test set.

Part 2: Nearest-neighbor classification¶

In this part, we'll familiarize ourselves with the PyTorch tensor API by implementing a very simple classifier, kNN, using efficient, vectorized tensor operations alone. We'll then implement cross-validation, an important ML technique used to find suitable values for a model's hyperparameters.

In [1]:
import os
import torch
import torchvision
import numpy as np
import matplotlib.pyplot as plt
import unittest

%matplotlib inline
%load_ext autoreload
%autoreload 2

plt.rcParams.update({'font.size': 12})
torch.random.manual_seed(1904)
test = unittest.TestCase()

kNN Classification¶

Arguably the most basic classification scheme in a supervised learning setting is the k nearest-neighbor (kNN) classifier. Given a training data set, kNN's "training" phase consists of simply memorizing it. When a classification of an unseen sample is required, some distance metric (e.g. euclidean) is computed from all training samples. The unseen sample is then classified according to the majority label of it's k nearest-neighbors.

Here we'll implement the most basic kNN, working directly on image pixel values and computing L2 distance between a test image and every known training image. We'll use data from the MNIST database of handwritten digits. This database contains single-channel images with a constant black background and the digits are roughly the same size, which makes it feasible to obtain bearable classification accuracy even with such a naïve model.

Note however that real-world KNN model are often implemented with tree-based data structures to find nearest neighbors in logarithmic time, specialized distance functions and using image features instead of raw pixels.

TODO Implement the TensorView transform in the hw1/transforms module, and run the following code to load the data we'll work with.

In [2]:
# Prepare data for kNN Classifier
import torchvision.transforms as tvtf

import cs236781.dataloader_utils as dataloader_utils
import hw1.datasets as hw1datasets
import hw1.transforms as hw1tf

# Define the transforms that should be applied to each CIFAR-10 image before returning it
tf_ds = tvtf.Compose([
    tvtf.ToTensor(), # Convert PIL image to pytorch Tensor
    hw1tf.TensorView(-1), # Reshape to 1D Tensor
])

# Define how much data to load (only use a subset for speed)
num_train = 10000
num_test = 1000
batch_size = 1024

# Training dataset & loader
data_root = os.path.expanduser('~/.pytorch-datasets')
ds_train = hw1datasets.SubsetDataset(
    torchvision.datasets.MNIST(root=data_root, download=True, train=True, transform=tf_ds), num_train)
dl_train = torch.utils.data.DataLoader(ds_train, batch_size)

# Test dataset & loader
ds_test = hw1datasets.SubsetDataset(
    torchvision.datasets.MNIST(root=data_root, download=True, train=False, transform=tf_ds), num_test)
dl_test = torch.utils.data.DataLoader(ds_test, batch_size)

# Get all test data
x_test, y_test = dataloader_utils.flatten(dl_test)

TODO Implement the l2_dist function in the hw1/knn_classifier.py module. This is the core of the kNN algorithm. You'll need to use broadcasting to implement it in an efficient, vectorized way (without loops).

In [3]:
import itertools as it
import hw1.knn_classifier as hw1knn

def l2_dist_naive(x1, x2):
    """
    Naive distance calculation, just for testing.
    Super slow, don't use!
    """
    dists = torch.empty(x1.shape[0], x2.shape[0], dtype=torch.float)
    for i, j in it.product(range(x1.shape[0]), range(x2.shape[0])):
        dists[i,j] = torch.sum((x1[i] - x2[j])**2).item()
    return torch.sqrt(dists)


# Test distance calculation
x1 = torch.randn(12, 34)
x2 = torch.randn(45, 34)

dists = hw1knn.l2_dist(x1, x2)
dists_naive = l2_dist_naive(x1, x2)

test.assertTrue(torch.allclose(dists, dists_naive), msg="Wrong distances")

TODO Implement the accuracy function in the hw1/knn_classifier.py module. This will be our score. It will simply return the fraction of predictions that are correct.

In [4]:
y1 = torch.tensor([0, 1, 2, 3])
y2 = torch.tensor([2, 2, 2, 2])

test.assertEqual(hw1knn.accuracy(y1, y2), 0.25)

TODO Complete the implementation of the KNNClassifier class in the module hw1/knn_classifier.py:

  1. Implement the kNN "training" in the train() method.
  2. Implement label prediction in the predict() method.

Use the following code to test your implementations.

In [5]:
# Test kNN Classifier
knn_classifier = hw1knn.KNNClassifier(k=10)
knn_classifier.train(dl_train)
y_pred = knn_classifier.predict(x_test)

# Calculate accuracy
accuracy = hw1knn.accuracy(y_test, y_pred)
print(f'Accuracy: {accuracy*100:.2f}%')

# Sanity check: at least 80% accuracy
test.assertGreater(accuracy, 0.8)
Accuracy: 80.50%

Cross-validation¶

A common way to choose hyperparameters for a model or even the model itself is by applying K-fold cross-validation (CV). For each candidate set of hyperparameters, the model is trained K times, each time with a different split of the training data to train and validation sets (called a fold). The set of hyperparameters which resulted in the the lowest average validation error rate is selected.

More specifically, K-fold CV is usually performed as follows:

  1. For all choices of a model and/or set of hyperparameters for the model:
    1. Split training set into K non-overlapping parts.
    2. For k=0,...,K-1:
      1. Select the k-th part as the validation set and the remaining k-1 parts as the training set.
      2. Train the current model on the current training set.
      3. Evaluate the model on the current validation set to obtain it's validation error.
    3. Calculate current model's average validation error accross the K folds.
  2. Select the model with the lowest average validation error.
  3. Train the selected model with the entire training set.
  4. Evaluate the model with the test set.

Now we would like to find the best value of K for applying our kNN model to CIFAR-10. In this case we already fixed the model and there is only one hyperparameter, the value of k (not to be confused with K, the number of folds for the cross validation).

TODO Complete the implementation of the find_best_k function in the knn_classifier.py module.

In [6]:
num_folds = 4
k_choices = [1, 3, 5, 8, 12, 20, 50]

# Run cross-validation
best_k, accuracies = hw1knn.find_best_k(ds_train, k_choices, num_folds)
In [7]:
# Plot accuracies per k
_, ax = plt.subplots(figsize=(12,6), subplot_kw=dict(xticks=k_choices))
for i, k in enumerate(k_choices):
    curr_accuracies = accuracies[i]
    ax.scatter([k] * len(curr_accuracies), curr_accuracies)

accuracies_mean = np.array([np.mean(accs) for accs in accuracies])
accuracies_std = np.array([np.std(accs) for accs in accuracies])
ax.errorbar(k_choices, accuracies_mean, yerr=accuracies_std)
ax.set_title(f'{num_folds}-fold Cross-validation on k')
ax.set_xlabel('k')
ax.set_ylabel('Accuracy')

print('best_k =', best_k)
best_k = 3

Now that we found our best_k, we can train the model with that value of k on the full training set and evaluate the accuracy on the test set:

In [8]:
knn_classifier = hw1knn.KNNClassifier(k=best_k)
knn_classifier.train(dl_train)
y_pred = knn_classifier.predict(x_test)

# Calculate accuracy
accuracy_best_k = hw1knn.accuracy(y_test, y_pred)
print(f'Accuracy: {accuracy_best_k*100:.2f}%')

test.assertGreater(accuracy_best_k, accuracy)
Accuracy: 81.70%

Questions¶

TODO Answer the following questions. Write your answers in the appropriate variables in the module hw1/answers.py.

In [9]:
from cs236781.answers import display_answer
import hw1.answers

Question 1¶

Does increasing k lead to improved generalization for unseen data? Why or why not? Up to what point? Think about the extremal values of k.

In [10]:
display_answer(hw1.answers.part2_q1)

Your answer:

Increasing k should improve the model's test accuracy at first, but when we cross a certain threshold the accuracy will start to drop again. This is because for k=1 we get an over-fit model, which classify a sample by only it's closest neighbor and for $k\rightarrow\infty$ (when k exceeds the size of the set) we get an under-fit model which decides an arbitrary decision rule by the majority label in the train set.

Question 2¶

Explain why (i.e. in what sense) using k-fold CV, as detailed above, is better than:

  1. Training on the entire train-set with various models and selecting the best model with respect to train-set accuracy.
  2. Training on the entire train-set with various models and selecting the best model with respect to test-set accuracy.
In [11]:
display_answer(hw1.answers.part2_q2)

Your answer:

  1. The kNN model with the highest train-set accuracy will always be 1NN because each sample is his own nearest neighbor.
  2. Selecting the model based on it's test-set accuracy is also a data leakage as we saw before in part 1.
$$ \newcommand{\mat}[1]{\boldsymbol {#1}} \newcommand{\mattr}[1]{\boldsymbol {#1}^\top} \newcommand{\matinv}[1]{\boldsymbol {#1}^{-1}} \renewcommand{\vec}[1]{\boldsymbol {#1}} \newcommand{\vectr}[1]{\boldsymbol {#1}^\top} \newcommand{\rvar}[1]{\mathrm {#1}} \newcommand{\rvec}[1]{\boldsymbol{\mathrm{#1}}} \newcommand{\diag}{\mathop{\mathrm {diag}}} \renewcommand{\set}[1]{\mathbb {#1}} \newcommand{\norm}[1]{\left\lVert#1\right\rVert} \newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\bb}[1]{\boldsymbol{#1}} $$

Part 3: Multiclass linear classification¶

In this part we'll learn about loss functions and how to optimize them with gradient descent. We'll then use this knowledge to train a very simple model: a linear SVM.

In [1]:
import os
import torch
import torchvision
import numpy as np
import matplotlib.pyplot as plt
import unittest

%matplotlib inline
%load_ext autoreload
%autoreload 2

plt.rcParams.update({'font.size': 12})
torch.random.manual_seed(1904)
test = unittest.TestCase()
In [2]:
s_j = torch.Tensor([1,2,3]).view(1,3)
s_i = torch.Tensor([1,2,3]).view(3,1)

print(s_j-s_i)
tensor([[ 0.,  1.,  2.],
        [-1.,  0.,  1.],
        [-2., -1.,  0.]])

Linear Classification¶

In multi-class linear classification we have $C$ classes which we assume our samples may belong to. We apply a linear function to a sample $\vec{x} \in \set{R}^{D}$ and obtain a score $s_j$ which represents how well $x$ fits the class $1\leq j\leq C$ according to our model: $$ s_j = \vec{w_j} \vec{x} + b_j. $$

Note that we have a different set of model parameters (weights) $\vec{w_j},~b_j$ for each class, so a total of $C\cdot(D+1)$ parameters.

To classify a sample, we simply calculate the score for each class and choose the class with the highest score as our prediction.

One interpretation of the weights $\vec{w_j},~b_j$ is that they represent the parameters of an $N$-dimensional hyperplane. Under this interpretation the class score $s_j$ of a sample is proportional to the distance of that sample from the hyperplane representing the $j$-th class. Note that this score can be positive or negative (depending on which side of the hyperplane the sample is). Such a classifier therefore splits the sample space into regions where the farther a sample is from the positive side of a hyperplane for class $j$, the higher $s_j$, so the more likely it belongs to class $j$.

Implementation¶

In the context of supervised learning of a linear classifier model, we map a dataset (or batch from a dataset) of $N$ samples (for example, images flattened to vectors of length $D$) to a score for one of each of $C$ possible classes using the linear function above.

To make the implementation efficient, we'll represent the mapping with a single matrix multiplication, employing the "Bias trick": Instead of both $\vec{w_j}$ and $b_j$ per class, we'll put the bias term at the beginning of the weight vector and add a term $1$ at the start of each sample.

The class scores for each sample are then given by:

$$ \mat{S} = \mat{X} \mat{W} $$

Where here (and in the code examples you'll work with),

  • $\mat{X}$ is a matrix of shape $N\times (D+1)$ containing $N$ samples in it's rows;
  • $\mat{W}$ is of shape $(D+1)\times C$ and contains the learnable classifier parameters (weights and bias);
  • $\mat{S}$ is therefore a $N\times C$ matrix of the class scores of each sample.

Notes:

  1. In the following discussions we'll use the notation $\vec{x_i}$ to denote the $i$-th training sample (row $i$ in $\mat{X}$) and $\vec{w_j}$ to denote the weights and bias for class $j$ (column $j$ in $\mat{W}$). However, when writing explicit vectors we treat them all as columns, so e.g. $\vec{w_j}\vec{x_i}$ is an inner product.
  2. The reason we put the samples in the rows of $\mat{X}$ and not columns (as is the convention in some texts) is because that's the convention in the pytorch library: the batch dimension is always the first one. This has many implementation advantages.

TODO Implement the BiasTrick transform class in the module hw1/transforms.py.

In [3]:
import hw1.transforms as hw1tf

tf_btrick = hw1tf.BiasTrick()

test_cases = [
    torch.randn(64, 512),
    torch.randn(2, 3, 4, 5, 6, 7),
    torch.randint(low=0, high=10, size=(1, 12)),
    torch.tensor([10, 11, 12])
]

for x_test in test_cases:
    xb = tf_btrick(x_test)
    print('shape =', xb.shape)
    test.assertEqual(x_test.dtype, xb.dtype, "Wrong dtype")
    test.assertTrue(torch.all(xb[..., 1:] == x_test), "Original features destroyed")
    test.assertTrue(torch.all(xb[..., [0]] == torch.ones(*xb.shape[:-1], 1)), "First feature is not equal to 1")
shape = torch.Size([64, 513])
shape = torch.Size([2, 3, 4, 5, 6, 8])
shape = torch.Size([1, 13])
shape = torch.Size([4])
In [4]:
import torchvision.transforms as tvtf

# Define the transforms that should be applied to each image in the dataset before returning it
tf_ds = tvtf.Compose([
    # Convert PIL image to pytorch Tensor
    tvtf.ToTensor(),
    # Normalize each chanel with precomputed mean and std of the train set
    tvtf.Normalize(mean=(0.1307,), std=(0.3081,)),
    # Reshape to 1D Tensor
    hw1tf.TensorView(-1), 
    # Apply the bias trick (add bias element to features)
    hw1tf.BiasTrick(),
])

The following code will use your transform to load a subset of the MNIST dataset for us to work with.

In [5]:
import hw1.datasets as hw1datasets
import hw1.dataloaders as hw1dataloaders

# Define how much data to load
num_train = 10000
num_test = 1000
batch_size = 1000

# Training dataset
data_root = os.path.expanduser('~/.pytorch-datasets')
ds_train = hw1datasets.SubsetDataset(
    torchvision.datasets.MNIST(root=data_root, download=True, train=True, transform=tf_ds),
    num_train)

# Create training & validation sets
dl_train, dl_valid = hw1dataloaders.create_train_validation_loaders(
    ds_train, validation_ratio=0.2, batch_size=batch_size
)

# Test dataset & loader
ds_test = hw1datasets.SubsetDataset(
    torchvision.datasets.MNIST(root=data_root, download=True, train=False, transform=tf_ds),
    num_test)
dl_test = torch.utils.data.DataLoader(ds_test, batch_size)

x0, y0 = ds_train[0]
n_features = torch.numel(x0)
n_classes = 10

# Make sure samples have bias term added
test.assertEqual(n_features, 28*28*1+1, "Incorrect sample dimension")

TODO Complete the implementation of the __init()__, predict() and evaluate_accuracy() functions in the LinearClassifier class located in the hw1/linear_classifier.py module.

In [6]:
import hw1.linear_classifier as hw1linear

# Create a classifier
lin_cls = hw1linear.LinearClassifier(n_features, n_classes)

# Evaluate accuracy on test set
mean_acc = 0
for (x,y) in dl_test:
    y_pred, _ = lin_cls.predict(x)
    mean_acc += lin_cls.evaluate_accuracy(y, y_pred)
mean_acc /= len(dl_test)

print(f"Accuracy: {mean_acc:.1f}%")
Accuracy: 6.4%

You should get an accuracy of around 10%, corresponding to a random guess of one of ten classes. You can run the above code block multiple times to sample different initial weights and get slightly different results.

Loss Functions¶

We have seen that a linear model computes the class scores for each sample using a linear mapping as a score function. However in order to train the model, we need to define some measure of how well we've classified our samples compared to their ground truth labels. This measure is known as a loss function, and it's selection is crucial in determining the model that will result from training. A loss function produces lower values the better the classification is.

Multiclass SVM loss function¶

A very common linear model for classification is the Support Vector Machine. An SVM attempts to find separating hyperplanes that have the property of creating a maximal margin to the training samples, i.e. hyperplanes that are as far as possible from the closest training samples. For example, in the following image we see a simple case with two classes of samples that have only two features. The data is linearly separable and it's easy to see there are infinite possible hyperplanes (in this case lines) that separate the data perfectly.

The SVM model finds the optimal hyperplane, which is the one with the maximal margin. The data points closest to the separating hyperplane are called the Support Vectors (it can be shown that only they determine the hyperplane). We can see that the width of the margin is $\frac{2}{\norm{\vec{w}}}$. In this simple case since the data is linearly separable, there exists a solution where no samples fall within the margin. If the data is not linearly separable, we need to allow samples to enter the margin (with a cost). This is known as a soft-margin SVM.

svm

There are many ways to train an SVM model. Classically, the problem is stated as constrained optimization and solved with quadratic optimization techniques. In this exercise, we'll instead work directly with the uncontrained SVM loss function, calculate it's gradient analytically, and then minimize it with gradient descent. As we'll see in the rest of the course, this technique will be a major component when we train deep neural networks.

The in-sample (empirical) loss function for a multiclass soft-margin SVM can be stated as follows:

$$ L(\mat{W}) = \frac{1}{N} \sum_{i=1}^{N} L_{i}(\mat{W}) + \frac{\lambda}{2} \norm{\mat{W}}^2 $$

Where the first term is the mean pointwise data-dependent loss $L_{i}$, given by the hinge loss formula,

$$ L_{i}(\mat{W}) = \sum_{j \neq y_i} \max\left(0, \Delta+ \vectr{w_j} \vec{x_i} - \vectr{w_{y_i}} \vec{x_i}\right), $$

and the second term is a regularization loss which depends only on model parameters. Note that the hinge loss term sums over the wrong class prediction scores for each sample: $j\neq y_i$, and $y_i$ is the ground-truth label for sample $i$. This can be understood as attempting to make sure that the score for the correct class is higher than the other classes by at least some margin $\Delta > 0$, otherwise a loss is incurred. This way, we allow samples to fall within the margin but incur loss, which gives us a soft-margin SVM.

The regularization term penalizes large weight magnitudes to prevent ambiguous solutions since if e.g. $\mat{W^*}$ is a weight matrix that perfectly separates the data, so is $\alpha\mat{W^*}$ for any scalar $\alpha \geq 1$.

Fitting an SVM model then amounts to finding the weight matrix $\mat{W}$ which minimizes $L(\mat{W})$. Note that we're writing the loss as a function of $\mat{W}$ to emphasize that we wish to minimize it's value on the given data by with respect to the weights $\mat{W}$, even though it obviously depends also on our specific dataset, $\left\{ \vec{x_i}, y_i \right\}_{i=1}^{N}$.

Implementation¶

TODO Implement the SVM hinge loss function in the module hw1/losses.py, within the SVMHingeLoss class. Implement just the loss() function. For now you can ignore the part about saving tensors for the gradient calculation. Run the following to test.

In [7]:
import cs236781.dataloader_utils as dl_utils
from hw1.losses import SVMHingeLoss

torch.random.manual_seed(42)

# Classify all samples in the test set
# because it doesn't depend on randomness of train/valid split
x, y = dl_utils.flatten(dl_test)

# Compute predictions
lin_cls = hw1linear.LinearClassifier(n_features, n_classes)
y_pred, x_scores = lin_cls.predict(x)

# Calculate loss with our hinge-loss implementation
loss_fn = SVMHingeLoss(delta=1.)
batch_loss = loss_fn(x, y, x_scores, y_pred)

# Compare to pre-computed expected value as a test
expected_loss = 9.0233
print("loss =", batch_loss.item())
print('diff =', abs(batch_loss.item()-expected_loss))
test.assertAlmostEqual(batch_loss.item(), expected_loss, delta=1e-2)
loss = 9.023381233215332
diff = 8.12332153312667e-05

Optimizing a Loss Function with Gradient Descent¶

In this section we'll implement a simple gradient descent optimizer for the loss function we've implemented above. As you have seen in the lectures, the basic gradient-based optimization scheme is as follows:

  1. Start with initial model weights $\mat{W_0}$ initialized randomly.
  2. For $k=1,2,\dots,K$:
    1. Select a step size $\eta_k$.
    2. Compute the gradient of the loss w.r.t. $\mat{W}$ and evaluate at the current weights: $\nabla_{\mat{W}} L(\mat{W_{k-1}})$.
    3. Update: $$ \mat{W_k} = \mat{W_{k-1}} - \eta_k \nabla_{\mat{W}} L(\mat{W_{k-1}}) $$
    4. Stop if minimum reached or validation-set loss is low enough.

The crucial component here is the gradient calculation. In this exercise we'll analytically derive the gradient of the loss and then implement it in code. In the next parts of the course we'll enjoy the automatic-differentiation features of PyTorch, but for now we'll do it the old-fashioned way.

An important detail to note is that while $L(\mat{W})$ is scalar-valued, it's a function of all the elements of the matrix $\mat{W}$. Therefore it's gradient w.r.t. $\mat{W}$ is also a matrix of the same shape as $\mat{W}$:

$$ \nabla_{\mat{W}} L = \begin{bmatrix} \frac{\partial L}{\partial W_{1,1}} & & \cdots & \frac{\partial L}{\partial W_{1,C}} \\ \frac{\partial L}{\partial W_{2,1}} & \ddots & \\ \vdots & & \ddots & \\ \frac{\partial L}{\partial W_{D+1,1}} & \cdots & & \frac{\partial L}{\partial W_{D+1,C}} \\ \end{bmatrix} = \begin{bmatrix} \vert & & \vert \\ \frac{\partial L}{\partial\vec{w_1}} & \cdots & \frac{\partial L}{\partial\vec{w_C}}\\ \vert & & \vert \\ \end{bmatrix} \in \set{R}^{(D+1)\times C}. $$

For our gradient descent update-step we'll need to create such a matrix of derivatives and evaluate it at the current value of the weight matrix.

SVM loss gradient¶

The first thing we need to do is formulate an expression for the gradient of the loss function defined above. Since the expression for the loss depends on the columns of $\mat{W}$, we'll derive an expression for the gradient of $L(\mat{W})$ w.r.t. each $\vec{w_j}$:

$$ \pderiv{L}{\vec{w_j}}(\mat{W}) = \frac{1}{N} \sum_{i=1}^{N} \pderiv{L_{i}}{\vec{w_j}}(\mat{W}) + \lambda \vec{w_j}. $$

To compute the gradient of the pointwise loss, let's define the margin-loss of sample $i$ for class $j$ as follows: $m_{i,j} = \Delta + \vectr{w_j}\vec{x_i} - \vectr{w_{y_i}}\vec{x_i}$. We can then write the pointwise loss and it's gradient in terms of $m_{i,j}$. We'll separate the case of $j=y_i$ (i.e. the gradient for the correct class):

$$ \begin{align} \pderiv{L_i}{\vec{w_j}} & = \begin{cases} \vec{x_i}, & m_{i,j}>0 \\ 0, & \mathrm{else} \\ \end{cases} ,~j \neq y_i \\ \pderiv{L_i}{\vec{w_{y_i}}} & = -\vec{x_i} \sum_{j\neq y_i} \mathbb{1}\left( m_{i,j} > 0 \right) \end{align} $$

Where $\mathbb{1}(\cdot)$ is an indicator function that takes the value $1$ if it's argument is a true statement, else it takes $0$.

Note: the hinge-loss function is not strictly speaking differentiable due to the $\max$ operator. However, in practice it's not a major concern. Given that we know what argument the $\max$ "chooses", we can differentiate each one of them separately. This is known as a sub-gradient. In the above, when $m_{i,j} \leq 0$ we know the gradient will simply be zero.

TODO Based on the above, implement the gradient of the loss function in the module hw1/losses.py, within the SVMHingeLoss class. Implement the grad() function and complete what's missing in the loss() function. Make sure you understand the above gradient derivation before attempting to implement it.

Note: you'll be implementing only the first term in the above equation for $\pderiv{L}{\vec{w_j}}(\mat{W})$. We'll add the regularization term later.

In [8]:
# Create a hinge-loss function
loss_fn = SVMHingeLoss(delta=1)

# Compute loss and gradient|
batch_loss = loss_fn(x, y, x_scores, y_pred)
final_grad = loss_fn.grad()

# Sanity check only (not correctness): compare the shape of the gradient
test.assertEqual(final_grad.shape, lin_cls.weights.shape)

But in the above we only checked the shape, how do we know if we've implemented the gradient correctly?

One approach is to recall the formal definition of the derivative, i.e. $$ f'(x)=\lim_{h\to 0} \frac{f(x+h)-f(x)}{h}. $$ Another way to put this is that for a small enough $h$, $$ f(x+h)\approx f(x)+f'(x)\cdot h. $$

We can use this approach to implement a gradient check by applying very small perturbations of each weight (separately) and using the above formula to check the correctness of the gradient (up to some tolerance). This is called a numerical gradient check.

Here we'll use a different approach, just to get a taste of the concept of automatic differentiation, which we'll rely on heavily in the rest of the course.

In the simple linear model we worked with, the gradient was fairly straightforward to derive analytically and implement. However for complex models such as deep neural networks with many layers and non-linear operations between them this is not the case. Additionally, the gradient must be re-derived any time either the model architecture or the loss function changes. These things make it infeasible in practice to perform deep-learning research using this manual method of gradient derivation. Therefore, all deep-learning frameworks provide a mechanism of automatic differentiation, to prevent the user from needing to manually derive the gradients of loss functions.

PyTorch provides this functionality in a package named torch.autograd which we will use further on in the next exercises. For now, here's an example showing that autograd can compute the gradient of the loss function you've implemented.

TODO Run the following code block. Try to understand how autograd is used and why. If the test fails, go back and fix your gradient calculation.

In [9]:
from hw1.losses import SVMHingeLoss

# Create a new classifier and loss function
lin_cls = hw1linear.LinearClassifier(n_features, n_classes)
loss_fn = SVMHingeLoss(delta=1)

# Specify we want the gradient to be saved for the weights tensor
# (just for our test)
lin_cls.weights.requires_grad = True

# Forward pass using the weights tensor, calculations will be tracked
y_pred, x_scores = lin_cls.predict(x)

# Compute loss of predictions and their analytic gradient
batch_loss = loss_fn(x, y, x_scores, y_pred)
final_grad = loss_fn.grad()

# Compute gradient with autograd
autograd_grad = torch.autograd.grad(batch_loss, lin_cls.weights)[0]

# Calculate the difference between analytic and autograd
diff = torch.norm(final_grad - autograd_grad).item()
print('loss =', batch_loss.item())
print('grad =\n', final_grad)
print('autograd =\n', autograd_grad)
print('diff =', diff)

test.assertLess(diff, 1e-3, "Gradient diff was too large")
loss = 8.961071014404297
grad =
 tensor([[ 0.1500, -0.2600, -0.1600,  ...,  0.0100,  0.1100,  0.0600],
        [-0.0636,  0.1103,  0.0679,  ..., -0.0042, -0.0467, -0.0255],
        [-0.0636,  0.1103,  0.0679,  ..., -0.0042, -0.0467, -0.0255],
        ...,
        [-0.0636,  0.1103,  0.0679,  ..., -0.0042, -0.0467, -0.0255],
        [-0.0636,  0.1103,  0.0679,  ..., -0.0042, -0.0467, -0.0255],
        [-0.0636,  0.1103,  0.0679,  ..., -0.0042, -0.0467, -0.0255]])
autograd =
 tensor([[ 0.1500, -0.2600, -0.1600,  ...,  0.0100,  0.1100,  0.0600],
        [-0.0636,  0.1103,  0.0679,  ..., -0.0042, -0.0467, -0.0255],
        [-0.0636,  0.1103,  0.0679,  ..., -0.0042, -0.0467, -0.0255],
        ...,
        [-0.0636,  0.1103,  0.0679,  ..., -0.0042, -0.0467, -0.0255],
        [-0.0636,  0.1103,  0.0679,  ..., -0.0042, -0.0467, -0.0255],
        [-0.0636,  0.1103,  0.0679,  ..., -0.0042, -0.0467, -0.0255]])
diff = 4.3655032641254365e-05

Training the model with SGD¶

Generally, solving a machine-learning problem requires defining the following components:

  • A model: architecture (type of model) consisting of hyperparameters (e.g. number of hidden layers, number of classes, etc) which are set in advance and trainable parameters which we want to fit to data.
  • A loss function (sometimes denoted as a criterion): evaluates the model output on some data compared to ground truth.
  • An optimization scheme: specifies how the model should be updated to improve the loss. May also have hyperparameters.
  • A dataset: What to fit the model to. Usually the available data is split into training, validation and test sets.

Now that we have implemented our loss function and it's gradient, we can finally train our model.

Implementation notes:

  • You'll find that when implementing your solutions it's wise to keep the above components separate as to be able to change each one of them independently from the other.
  • In this exercise we'll have separated the loss and dataset, however for simplicity we'll implement the model and optimizer together.
  • As you'll see further on, PyTorch provides very effective mechanisms to implement all of these components in a decoupled manner.
  • Note that our loss implementation didn't include regularization. We'll add this during the training phase using the weight_decay parameter. The reason is that we prefer that the part of the loss which only depends on the model parameters be part of the optimizer, not the loss function (though both ways are possible). You'll see this pattern later on when you use PyTorch's optimizers in the torch.optim package.
  • In practice we use batches of samples from the training set when training the model, because usually the training set can't fit into memory. Using gradients computed on batches of data at a time is known as mini-batch stochastic gradient descent (SGD).

TODO

  1. Implement the model training loop in the LinearClassifier's train() method. Use mini-batch SGD for the weight update rule.
  2. Update the training hyperparameters in the hyperparams function. You should play with the hyperparameters to get a feel for what they do to the loss and accuracy graphs.
In [10]:
hp = hw1linear.hyperparams()
print('hyperparams =', hp)

lin_cls = hw1linear.LinearClassifier(n_features, n_classes, weight_std=hp['weight_std'])

# Evaluate on the test set
x_test, y_test = dl_utils.flatten(dl_test)
y_test_pred , _= lin_cls.predict(x_test)
test_acc_before = lin_cls.evaluate_accuracy(y_test, y_test_pred)

# Train the model
svm_loss_fn = SVMHingeLoss()
train_res, valid_res = lin_cls.train(dl_train, dl_valid, svm_loss_fn,
                                    learn_rate=hp['learn_rate'], weight_decay=hp['weight_decay'],
                                    max_epochs=30)

# Re-evaluate on the test set
y_test_pred , _= lin_cls.predict(x_test)
test_acc_after = lin_cls.evaluate_accuracy(y_test, y_test_pred)

# Plot loss and accuracy
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(12,5))
for i, loss_acc in enumerate(('loss', 'accuracy')):
    axes[i].plot(getattr(train_res, loss_acc))
    axes[i].plot(getattr(valid_res, loss_acc))
    axes[i].set_title(loss_acc.capitalize(), fontweight='bold')
    axes[i].set_xlabel('Epoch')
    axes[i].legend(('train', 'valid'))
    axes[i].grid(which='both', axis='y')
    
# Check test set accuracy
print(f'Test-set accuracy before training: {test_acc_before:.1f}%')
print(f'Test-set accuracy after training: {test_acc_after:.1f}%')
test.assertGreaterEqual(test_acc_after, 85.0)
hyperparams = {'weight_std': 0.05, 'learn_rate': 0.05, 'weight_decay': 0.01}
Training..............................
Test-set accuracy before training: 13.5%
Test-set accuracy after training: 88.0%

Even though this is a very naïve model, you should get at least 85% test set accuracy if you implemented training correctly. You can try to change the hyperparameters and see whether you get better results. Generally this should be done with cross-validation.

Visualization¶

One way to understand what models learn is to try to visualize their learned parameters. There can be many ways to do this. Let's try a very simple one, which is to reshape them into images of the input size and see what they look like.

TODO Implement the weights_as_images() function in the LinearClassifier class.

In [11]:
import cs236781.plot as plot

w_images = lin_cls.weights_as_images(img_shape=(1,28,28))
fig, axes = plot.tensors_as_images(list(w_images))

Additionally, we can better understand the model by plotting some samples and looking at wrong predictions. Run the following block to visualize some test-set examples and the model's predictions for them.

In [12]:
# Plot some images from the test set and their predictions
n_plot = 104
x_test, y_test = next(iter(dl_test))
x_test = x_test[0:n_plot]
y_test = y_test[0:n_plot]
y_test_pred, _ = lin_cls.predict(x_test)
x_test_img = torch.reshape(x_test[:, :-1], (n_plot, 1, 28, 28))

fig, axes = plot.tensors_as_images(list(x_test_img), titles=y_test_pred.numpy(),
                                   nrows=8, hspace=0.5, figsize=(10,8), cmap='gray')

# Highlight the wrong predictions
wrong_pred = y_test_pred != y_test
wrong_pred_axes = axes.ravel()[wrong_pred.numpy().astype(bool)]
for ax in wrong_pred_axes:
    ax.title.set_color('red')
    ax.title.set_fontweight('bold')

Questions¶

TODO Answer the following questions. Write your answers in the appropriate variables in the module hw1/answers.py.

In [13]:
from cs236781.answers import display_answer
import hw1.answers

Question 1¶

Explain why the selection of $\Delta > 0$ is arbitrary for the SVM loss $L(\mat{W})$ as it is defined above (the full in-sample loss, with the regularization term).

In [14]:
display_answer(hw1.answers.part3_q1)

Your answer:

$\Delta$ is an hyperparameter that determines the minimum margin between the scores of the correct class and the scores of any incorrect class. thus, the choice of $\Delta$ is arbitrary because $\lambda$ can also control the margin size (generally, higher $\lambda$ = smaller margins) and balances the effect of $\Delta$.

Question 2¶

Given the images in the visualization section above,

  1. How do you interpret what the linear model is actually learning? Can you explain some of the classification errors based on it?
  2. How is this interpretation similar or different from KNN?
In [15]:
display_answer(hw1.answers.part3_q2)

Your answer:

  1. It seems like the SVM model is learning the general (or average) shape of each class (digit).

We can see in the images that the bright area (corresponding to the higher weights) are formulating a similar shape (in most cases) to the digit shape. This is espacially clear for digits '0', '1' and '3'.

  1. The similarity is that both models are comparing the new digit shape to the old ones (the train set).

The difference is that kNN is comparing the new digit to its closest neigbors (which can be interpeted as the most similar shapes) and then labels the new digit by the majority. While the SVM model is learning an "average" digit for each digit 0-9 then comparing the new digit to these averages and selecting the most similar one.

Question 3¶

  1. Based on the graph of the training set loss, would you say that the learning rate you chose is:
    • Too low
    • Good
    • Too High

Explain your answer by describing what the loss graph would look like in the other two cases when training for the same number of epochs.

  1. Based on the graph of the training and test set accuracy, would you say that the model is:
    • Slightly overfitted to the training set
    • Highly overfitted to the training set
    • Slightly underfitted to the training set
    • Highly underfitted to the training set

and why?

In [16]:
display_answer(hw1.answers.part3_q3)

Your answer:

  1. Based on the graph, we would say that the learning rate we chose ($0.05$) is Good.

That's because the graph is showing a "nice" convergence to the minimum - the convergence is overall smooth and showing a steady decrease towadrs the minimum.

  • For too high learning rate, we would see big fluctuations on the graph, or even a divergence.
  • For too low learning rate, the learning curve would be smoother but it won't be able to reach the minimum at the given number of epochs.
  1. Based on the graph, we would say that the model is Slightly overfitted to the training set.

We can see that there is a small gap between the train curve and the validation curve in the graph, the train curve is slightly higher than the validation one, indicating that the model is not generalizing well.

$$ \newcommand{\mat}[1]{\boldsymbol {#1}} \newcommand{\mattr}[1]{\boldsymbol {#1}^\top} \newcommand{\matinv}[1]{\boldsymbol {#1}^{-1}} \renewcommand{\vec}[1]{\boldsymbol {#1}} \newcommand{\vectr}[1]{\boldsymbol {#1}^\top} \newcommand{\rvar}[1]{\mathrm {#1}} \newcommand{\rvec}[1]{\boldsymbol{\mathrm{#1}}} \newcommand{\diag}{\mathop{\mathrm {diag}}} \renewcommand{\set}[1]{\mathbb {#1}} \newcommand{\norm}[1]{\left\lVert#1\right\rVert} \newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\bb}[1]{\boldsymbol{#1}} $$

Part 4: Linear Regression¶

In this part we'll perform the classic machine learning task of linear regression. We'll do some simple data exploration and feature engineering, like in the pre-deep-learning days. Our solution will be implemented using some very widely used machine-learning python libraries (numpy, scikit-learn and pandas). We'll then explore the generalization capacity of the model and perform cross validation.

In [1]:
import numpy as np
import pandas as pd
import sklearn
import matplotlib.pyplot as plt
import unittest

%matplotlib inline
%load_ext autoreload
%autoreload 2

plt.rcParams.update({'font.size': 14})
np.random.seed(42)
test = unittest.TestCase()

Dataset exploration¶

We'll be working with the Boston housing dataset. This is a famous toy dataset for benchmarking regression algorithms.

The dataset contains 506 samples of median house values in Boston, each with 13 associated house and neighborhood attributes (features; see link for their meaning). The 13 features of each house are our independent variables, and we're trying to predict the value of MEDV, the median house price (in units of $1000).

Run the following block to load the data. Since this dataset is very small, we can load it directly into memory and forgo any lazy-loading mechanisms.

In [2]:
import sklearn.datasets
import warnings

# Load data we'll work with - Boston housing dataset
# We'll use sklearn's built-in data
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    
    ds_boston = sklearn.datasets.load_boston()

feature_names = list(ds_boston.feature_names)

n_features = len(feature_names)
x, y = ds_boston.data, ds_boston.target
n_samples = len(y)
print(f'Loaded {n_samples} samples')
Loaded 506 samples

Let's use pandas to visualize the independent and target variables. We'll just show the first 10 samples.

In [3]:
# Load into a pandas dataframe and show some samples
df_boston = pd.DataFrame(data=x, columns=ds_boston.feature_names)
df_boston = df_boston.assign(MEDV=y)
df_boston.head(10).style.background_gradient(subset=['MEDV'], high=1.)
Out[3]:
  CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT MEDV
0 0.006320 18.000000 2.310000 0.000000 0.538000 6.575000 65.200000 4.090000 1.000000 296.000000 15.300000 396.900000 4.980000 24.000000
1 0.027310 0.000000 7.070000 0.000000 0.469000 6.421000 78.900000 4.967100 2.000000 242.000000 17.800000 396.900000 9.140000 21.600000
2 0.027290 0.000000 7.070000 0.000000 0.469000 7.185000 61.100000 4.967100 2.000000 242.000000 17.800000 392.830000 4.030000 34.700000
3 0.032370 0.000000 2.180000 0.000000 0.458000 6.998000 45.800000 6.062200 3.000000 222.000000 18.700000 394.630000 2.940000 33.400000
4 0.069050 0.000000 2.180000 0.000000 0.458000 7.147000 54.200000 6.062200 3.000000 222.000000 18.700000 396.900000 5.330000 36.200000
5 0.029850 0.000000 2.180000 0.000000 0.458000 6.430000 58.700000 6.062200 3.000000 222.000000 18.700000 394.120000 5.210000 28.700000
6 0.088290 12.500000 7.870000 0.000000 0.524000 6.012000 66.600000 5.560500 5.000000 311.000000 15.200000 395.600000 12.430000 22.900000
7 0.144550 12.500000 7.870000 0.000000 0.524000 6.172000 96.100000 5.950500 5.000000 311.000000 15.200000 396.900000 19.150000 27.100000
8 0.211240 12.500000 7.870000 0.000000 0.524000 5.631000 100.000000 6.082100 5.000000 311.000000 15.200000 386.630000 29.930000 16.500000
9 0.170040 12.500000 7.870000 0.000000 0.524000 6.004000 85.900000 6.592100 5.000000 311.000000 15.200000 386.710000 17.100000 18.900000

Let's explore the data a bit by plotting a scatter matrix of every variable as a function of every other and a histogram for each.

In [4]:
pd.plotting.scatter_matrix(df_boston, figsize=(20,20));

The above chart shows us (among other things) how our target variable MEDV behaves as a function of the features (bottom row). By looking at it, can you guess which relationships might be good candidates for a linear model?

Let's use a simple method for deciding which features to use for our linear model: the correlation coefficient, defined as

$$ \rho_{\vec{x}\vec{y}} = \frac{\sigma_{\vec{x}\vec{y}}}{\sigma_{\vec{x}} \sigma_{\vec{y}}} = \frac {\sum_{i=1}^{N} (x_i - \mu_{\vec{x}}) (y_i - \mu_{\vec{y}}) } {\sqrt{\sum_{i=1}^{N} (x_i - \mu_{\vec{x}})^2} \cdot \sqrt{\sum_{i=1}^{N} (y_i - \mu_{\vec{y}})^2}} $$

Where $\vec{x}, \vec{y}$ are $N$ samples of two variables and $\mu, \sigma$ refer to sample means and (co-)variances respectively. The value of $\rho$ is $\pm 1$ for perfect positive or negative linear relationships ($y=ax+b$), and somewhere in between when it's not perfect. Note that this coefficient is rather limited: even when $\rho=0$, the variables may be highly dependent, just not in a linear fashion.

Let's implement this method to find out which features we should include in our initial linear model.

TODO Implement the top_correlated_features() function in the hw1/linear_regression.py module.

In [5]:
import hw1.linear_regression as hw1linreg

n_top_features = 5
top_feature_names, top_corr = hw1linreg.top_correlated_features(df_boston, 'MEDV', n_top_features)
print('Top features: ', top_feature_names)
print('Top features correlations: ', top_corr)

# Tests
test.assertEqual(len(top_feature_names), n_top_features)
test.assertEqual(len(top_corr), n_top_features)
test.assertAlmostEqual(np.sum(np.abs(top_corr)), 2.893, delta=1e-3) # compare to precomputed value for n=5
Top features:  ['LSTAT', 'RM', 'PTRATIO', 'INDUS', 'TAX']
Top features correlations:  [-0.7376627261740151, 0.6953599470715393, -0.5077866855375616, -0.48372516002837357, -0.46853593356776685]

Linear Regression Model¶

Arguably the simplest machine learning model is linear regression. We are given a dataset $\left\{\vec{x}^{(i)}, y^{(i)}\right\}_{i=1}^{N}$ where $\vec{x}^{(i)} \in \set{R}^D$ is a $D$-dimensional feature vector and $y^{(i)}\in\set{R}$ is a continuous quantity assumed to be the output of some unknown function, i.e. $y^{(i)} = f(\vec{x}^{(i)})$.

Our goal will be to fit a linear transformation, parametrized by weights vector and bias term $\vec{w}, b$, such that given a sample $\vec{x}$ our prediction is

$$ \hat{y} = \vectr{w}\vec{x} + b. $$

We'll judge the performance of the model using the ordinary least-squares sense, i.e. with a loss function of given by the mean-squared error (MSE) with the addition of an L2-regularization term: $$ L(\vec{w}) = \frac{1}{2N} \sum_{i=1}^{N} \left( y^{(i)} - \hat{y}^{(i)} \right)^2 + \frac{\lambda}{2}\norm{\vec{w}}^2_2 = \frac{1}{2N} \sum_{i=1}^{N} \left( y^{(i)} - \vectr{w}\vec{x}^{(i)} - b \right)^2 + \frac{\lambda}{2}\norm{\vec{w}}^2_2. $$

Minimizing the above $L(\vec{w})$ is a simple convex optimization problem with a closed-form solution. Of course, this can also be solved using iterative descent methods which are necessary when the data is too large to fit in memory.

As a warm up with numpy, let's implement the bias trick again (this time using numpy and as a sklearn transformation) so that our linear regression model will operate on data with an added bias term.

TODO Implement the class BiasTrickTransformer in the hw1/linear_regression.py module.

In [6]:
# Test BiasTrickTransformer
bias_tf = hw1linreg.BiasTrickTransformer()

test_cases = [
    np.random.randint(10, 20, size=(5,2)),
    np.random.randn(10, 1),
]

for xt in test_cases:
    xb = bias_tf.fit_transform(xt)
    print(xb.shape)
    
    test.assertEqual(xb.ndim, 2)
    test.assertTrue(np.all(xb[:,0] == 1))
    test.assertTrue(np.all(xb[:, 1:] == xt))
    
(5, 3)
(10, 2)

Lets now define a function to assess the accuracy of our models prediction (loss and score). We'll use the MSE loss as above and $R^2$ as a score. Note that $R^2$ is a number in the range [0, 1] which represents how much better the regression fits the data in compared to a simple average of the data. It is given by $$ R^2 = 1-\frac{\sum_i (e^{(i)})^2}{\sum_i (y^{(i)} - \bar{y})^2}, $$ where $e^{(i)} = y^{(i)} - \hat{y}^{(i)}$ is known as the residual for each sample $i$ and $\bar{y}$ is the data mean.

TODO Implement the mse_score and r2_score function in the hw1/linear_regression.py module.

In [7]:
def evaluate_accuracy(y: np.ndarray, y_pred: np.ndarray):
    """
    Calculates mean squared error (MSE) and coefficient of determination (R-squared).
    :param y: Target values.
    :param y_pred: Predicted values.
    :return: A tuple containing the MSE and R-squared values.
    """
    mse = hw1linreg.mse_score(y, y_pred)
    rsq = hw1linreg.r2_score(y, y_pred)
    return mse, rsq

Of course, these measures and many others are built-in to sklearn. We'll use these to test.

In [8]:
from sklearn.metrics import r2_score as r2, mean_squared_error as mse

for i in range(10):
    test_y = np.random.randn(20)
    test_y_pred = np.random.randn(20)
    
    mse_actual, r2_actual = evaluate_accuracy(test_y, test_y_pred)
    mse_expected, r2_expected = mse(test_y, test_y_pred), r2(test_y, test_y_pred)
    
    test.assertAlmostEqual(mse_actual, mse_expected, delta=1e-6)
    test.assertAlmostEqual(r2_actual, r2_expected, delta=1e-6)

Now we can implement our model.

TODO Based on the above equations for the model and loss, implement the predict() and fit() functions in the LinearRegressor class within the module linear_regression.py. You'll need to first derive the closed-form solution for the optimal $\vec{w}$ based on the loss. Run the code block below to fit your model to each of the 5 top features you selected (one at a time).

A very useful feature of sklearn is pipelines: We can create a composite model made of multiple steps which transform the features (using fit_transform) and a final step which calculates the actual model predictions (using fit_predict()). Each step in the pipeline should be an sklearn Estimator instance and implement the appropriate methods.

For example, lets create a pipeline that scales each input feature to zero-mean and unit variance, applies our bias-trick transformation and finally uses our Linear Regression model.

In [9]:
import sklearn.preprocessing
import sklearn.pipeline

# Create our model as a pipline:
# First we scale each feature, then the bias trick is applied, then the regressor
model = sklearn.pipeline.make_pipeline(
        sklearn.preprocessing.StandardScaler(),
    hw1linreg.BiasTrickTransformer(),
    hw1linreg.LinearRegressor(),
)

# Test the model implementation is correct
y_pred = model.fit_predict(x, y)
full_dataset_mse, _ = evaluate_accuracy(y, y_pred)
test.assertEqual(y_pred.shape, y.shape)
test.assertAlmostEqual(full_dataset_mse, 22.660, delta=1e-1)

From here we'll use our pipleline as a model.

We want to now check the predictive power of different features. First, we'll implement a small helper function that will allow us to fit a model on a subset of features from our dataframe.

TODO Implement the fit_predict_dataframe function in the linear_regression.py module.

In [10]:
# Full dataset
y_pred = hw1linreg.fit_predict_dataframe(
    model, df_boston, target_name='MEDV'
)
test.assertAlmostEqual(full_dataset_mse, evaluate_accuracy(y,y_pred)[0], delta=1e-1)

# Subset of features
y_pred = hw1linreg.fit_predict_dataframe(
    model, df_boston, target_name='MEDV', feature_names=['CHAS', 'B']
)
test.assertAlmostEqual(72.982, evaluate_accuracy(y,y_pred)[0], delta=1e-1)

We'll use each feature separately and fit multiple times to get an idea of the predictive power of each of our top-5 features.

In [11]:
fig, ax = plt.subplots(nrows=1, ncols=n_top_features, sharey=True, figsize=(20,5))
actual_mse = []

# Fit a single feature at a time
for i, feature_name in enumerate(top_feature_names):
    y_pred = hw1linreg.fit_predict_dataframe(model, df_boston, 'MEDV', [feature_name])
    mse, rsq = evaluate_accuracy(y, y_pred)

    # Plot
    xf = df_boston[feature_name].values.reshape(-1, 1)
    x_line = np.arange(xf.min(), xf.max(), 0.1, dtype=float).reshape(-1, 1)
    y_line = model.predict(x_line)
    ax[i].scatter(xf, y, marker='o', edgecolor='black')
    ax[i].plot(x_line, y_line, color='red', lw=2, label=f'fit, $R^2={rsq:.2f}$')
    ax[i].set_ylabel('MEDV')
    ax[i].set_xlabel(feature_name)
    ax[i].legend(loc='upper right')
    
    actual_mse.append(mse)

# Test regressor implementation
print(actual_mse)
expected_mse = [38.862, 43.937, 62.832, 64.829, 66.040]
for i in range(len(expected_mse)):
    test.assertAlmostEqual(expected_mse[i], actual_mse[i], delta=1e-1)
/home/lotanamit5/miniconda3/envs/deep_env/lib/python3.8/site-packages/sklearn/base.py:450: UserWarning: X does not have valid feature names, but StandardScaler was fitted with feature names
  warnings.warn(
/home/lotanamit5/miniconda3/envs/deep_env/lib/python3.8/site-packages/sklearn/base.py:450: UserWarning: X does not have valid feature names, but StandardScaler was fitted with feature names
  warnings.warn(
/home/lotanamit5/miniconda3/envs/deep_env/lib/python3.8/site-packages/sklearn/base.py:450: UserWarning: X does not have valid feature names, but StandardScaler was fitted with feature names
  warnings.warn(
/home/lotanamit5/miniconda3/envs/deep_env/lib/python3.8/site-packages/sklearn/base.py:450: UserWarning: X does not have valid feature names, but StandardScaler was fitted with feature names
  warnings.warn(
/home/lotanamit5/miniconda3/envs/deep_env/lib/python3.8/site-packages/sklearn/base.py:450: UserWarning: X does not have valid feature names, but StandardScaler was fitted with feature names
  warnings.warn(
[38.862608460689785, 43.93789891484722, 62.83209551907834, 64.82947233954711, 66.0404346824534]

As you can see, the results are not great. We can't reliably predict the target variable based on just one of these. Now let's fit a model based on the combined top-5 features. Since it's difficult to visualize high-dimensional hyperplanes, instead of plotting the data and fitted hyperplane, we'll create a residuals plot. This is the plot of the error, or residual $e^{(i)} = y^{(i)} - \hat{y}^{(i)}$ vs. the predicted value $\hat{y}^{(i)}$.

In [12]:
# Fit top-5 features
y_pred = hw1linreg.fit_predict_dataframe(model,  df_boston, 'MEDV', top_feature_names)
mse5, rsq5 = evaluate_accuracy(y, y_pred)
print(f'mse5={mse5:.2f}, rsq5={rsq5:.2f}')

# Residuals plot
def plot_residuals(y, y_pred, ax=None, res_label=None):
    if ax is None:
        _, ax = plt.subplots()
    res = y - y_pred
    ax.scatter(y_pred, y_pred-y, marker='s', edgecolor='black', label=res_label)
    ax.hlines(y=0, xmin=y_pred.min(), xmax=y_pred.max(), color='red', lw=3)
    ax.hlines(y=[-res.std(), res.std()], xmin=y_pred.min(), xmax=y_pred.max(), color='red', lw=3, linestyles=':')
    ax.set_xlabel(r'$\hat{y}$')
    ax.set_ylabel(r'$y - \hat{y}$')
    if res_label is not None:
        ax.legend()
    return ax

plot_residuals(y, y_pred)

# Sanity test
test.assertLess(mse5, 30)
mse5=27.20, rsq5=0.68

That's better, but there's still more to be desired. Let's try to improve our model further.

Adding nonlinear features¶

We can see that from the scatter matrix that some of the relationships between our features and target variable are obviously not linear and cannot be modeled completely by fitting lines (or hyperplanes). Is there a way to fit a non-linear function to the data (such as a polynomial) but still use the simplicity of the Linear Regression model?

Suppose we have 2-dimensional feature vectors, $\vec{x}=(x_1, x_2)$. We can fit a linear regression model with 3 parameters which represents some 2-d plane. However if we transform each such feature vector, for example by $\vec{\tilde{x}} = (x_1, x_2, x_1^2, x_1 x_2, x_2^2)$, then we can now fit a model with 6 parameters to the same data. We can thus increase the capacity of our model (its ability to fit a wide variety of functions) by adding more parameters that correspond to non-linear transformations of the features.

Let's implement some hand-crafted nonlinear features based on all the features in the dataset. This step in the machine learning process is sometimes also referred to as feature engineering. In the rest of the course, you'll see how Deep Learning allows us to learn the features themselves instead of creating them by hand, and thus creating very powerful representations.

TODO Implement the BostonFeaturesTransformer class in the hw1/linear_regression.py module. You can create any features you want, for example given $\vec{x}=(x_1,x_2)$ you could generate features such as $x_1^2$, $x_1 \log{x_2}$, $e^{-x_1}$ and so on. Try to "engineer" features by inferring relationships based on the scatter matrix.

Notes:

  • You can use the class PolynomialFeatures from sklearn.preprocessing to simplify generation of polynomial features.
  • Removing a feature is also a new feature. You can and should discard features if they're not helpful.
In [13]:
def linreg_boston(model, x, y, fit=True):
    if fit:
        model.fit(x, y)
    y_pred = model.predict(x)
    mse, rsq = evaluate_accuracy(y, y_pred)
    return y_pred, mse, rsq

# Fit with all features this time
x = df_boston[feature_names].values

# Use model with a custom features transform
model = sklearn.pipeline.make_pipeline(
    hw1linreg.BiasTrickTransformer(),
    hw1linreg.BostonFeaturesTransformer(),
    hw1linreg.LinearRegressor()
)

y_pred, mse, rsq = linreg_boston(model, x, y)
plot_residuals(y, y_pred)

# Test: You should get at least 2x lower loss than previously, easily even lower
print(f'target_mse={mse5/2:.3f}')
print(f'mse={mse:.2f}, rsq={rsq:.2f}')
test.assertLess(mse, mse5 / 2)
target_mse=13.601
mse=7.53, rsq=0.91

Generalization¶

By now, your model should produce fairly accurate predictions. Note howerver that we trained it on the entire Boston dataset.

When training models, we don't actually care about their performance on the training data; we're not interested in solving optimization problems. What we want is the ability to generalize: How well will it perform on novel, unseen data? In other words, did the model learn some function similar to the one actually generating the samples?

Let's find out how good our model is for unseen data the usual way: We'll split our dataset into a training and test set.

In [14]:
from sklearn.model_selection import train_test_split

# Data and model
x = df_boston[feature_names].values
y = df_boston['MEDV'].values
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

model = sklearn.pipeline.make_pipeline(
    hw1linreg.BiasTrickTransformer(),
    hw1linreg.BostonFeaturesTransformer(),
    hw1linreg.LinearRegressor()
)

However, instead of just fitting the model on the training set and evaluating on the test set, we'll use cross-validation to find a set of model hyperparameters that allow the model to generalize well.

We'll again use k-fold CV to split the training set into k-folds where for each set of hyperparameters being tested, each time one of the folds is treated like the test set and the model is fitted to the rest. However, this time we have more hyperparameters to test.

TODO Implement the cv_best_hyperparams() function in the hw1/linear_regression.py module.

In [15]:
# Define search-spaces for hyper parameters
degree_range = np.arange(1, 4)
lambda_range = np.logspace(-3, 2, base=10, num=20)

# Use cross-validation to find best combination of hyperparameters
best_hypers = hw1linreg.cv_best_hyperparams(
    model, x_train, y_train, k_folds=3,
    degree_range=degree_range, lambda_range=lambda_range
) 

print('Best hyperparameters: ', best_hypers)

# Make sure returned params exist in the model
for param in best_hypers.keys():
    test.assertIn(param, model.get_params())
Best hyperparameters:  {'bostonfeaturestransformer__degree': 2, 'linearregressor__reg_lambda': 0.23357214690901212}

Now lets use the best hyperparameters to train a model on the training set and evaluate it on the test set.

In [16]:
# Use the best hyperparameters
model.set_params(**best_hypers)

# Train best model on full training set
y_pred_train, mse, rsq = linreg_boston(model, x_train, y_train)
print(f'train: mse={mse:.2f}, rsq={rsq:.2f}')
ax = plot_residuals(y_train, y_pred_train, res_label='train')

# Evaluate on test set
y_pred_test, mse, rsq = linreg_boston(model, x_test, y_test, fit=False)
print(f'test:  mse={mse:.2f}, rsq={rsq:.2f}')
ax = plot_residuals(y_test, y_pred_test, ax=ax, res_label='test')

# Make sure test-set accuracy is good
test.assertLess(mse, 20) # You should be able to get way below this
train: mse=7.04, rsq=0.92
test:  mse=18.11, rsq=0.74

Questions¶

TODO Answer the following questions. Write your answers in the appropriate variables in the module hw1/answers.py.

In [17]:
from cs236781.answers import display_answer
import hw1.answers

Question 1¶

Whats the ideal pattern to see in a residual plot? Based on the residual plots you got above, what can you say about the fitness of the trained model? Compare the plot for the top-5 features with the final plot after CV.

In [18]:
display_answer(hw1.answers.part4_q1)

Your answer:

The ideal pattern to see in a residual plot is a flat line across the x axis. This is because that line means that all the residuals are 0.

Comparing the 2 plots: | Plot | MSE | R2 | |-------------------- |----------|----------| | top-5 | 27.20 | 0.68 | | train after CV | 7.04 | 0.92 | | test after CV | 18.11 | 0.74 |

In the plots, we can see that we have much less outliers outside the area of low residual error. Also, inside the low error zone we can see that the sample's error are relativly closer to 0 than before. On top of that, we can empirically see in the table above that we improved our MSE and R2 scores for both sets compared to the top-5 train set.

Question 2¶

Explain the effect of adding non-linear features to our data.

  1. Is this still a linear regression model? Why or why not?
  2. Can we fit any non-linear function of the original features with this approach?
  3. Imagine a linear classification model. As we saw in Part 3, the parameters $\mat{W}$ of such a model define a hyperplane representing the decision boundary. How would adding non-linear features affect the decision boundary of such a classifier? Would it still be a hyperplane? Why or why not?
In [19]:
display_answer(hw1.answers.part4_q2)

Your answer:

  1. This is still a linear regression. We transformed the features but the regression itself is still the same linear regression.

(The regression line is still an hyperplane, but the number of dimensions has changed).

  1. In introduction to ML, we learned about the RBF kernel, which is a feature tranformation we can apply that is very expressive and can fit any non-linear function

(but with the drawback of tendency to overfit).

  1. The decision boundry will still be an hyperplane, but it will be an hyperplane in the dimension of the tansformed features, and not the original ones.

Thus, resulting a non-linear boundry in the original feature's plane.

Question 3¶

Regarding the cross-validation:

  1. When defining the range for $\lambda$ the in the above CV code, why do you think we used np.logspace instead of np.linspace? Explain the advantage for CV.
  2. How many times in total was the model fitted to data (with the parameters as given, and not including the final fit on the entire training set)?
In [20]:
display_answer(hw1.answers.part4_q3)

Your answer:

  1. np.logspace allows us to creates a range of values that are evenly spaced on a logarithmic scale, while np.linspace creates a range of values that are evenly spaced on a linear scale.

Using 'np.logspace' allows us to test different orders of magnitude for our parameters, making the search process much more useful and reducing the number of parameters to test. The advantages of CV is that it allows us to evaluate the parameter's performance on the test set without actually using the test set. Also, the CV calculates the mean performance on each fold and by that preventing the case of taking a "bad" validation set to check our parameters. By that, CV allows us to tune our parameters to be very close to the optimal ones.

  1. The degree range was of length 3 and the lambda range was of length 20, so our grid search tested 60 combinations.

Each of the combinations was trained and tested 3 times (once for each fold). resulting $3\cdot 20 \cdot 3 = 180$ fits for our model.